Inferring Ideology from Incidents:
An Analysis of the Global Terrorism
By Tyler A. Clark

Introduction

In 2019, there were nearly 8,500 terrorist attacks around the world, killing more than 20,000 people (Source: Global Terrorism Overview). In order to successfully mitigate and combat terrorism it is imperative to understand the complex geopolitial dynamics that enable terrorism and terrorist ideologies. This project aims to analyze what one can infer about terrorist ideologies from data about their attacks. If one can identify causal relationships between characteristics of terrorists, including their ideologies and group structure, and the types of attacks perpetrated by these terrorists, then perhaps we could create informed policy to predict and mitigate terrorism.

Before proceeding, it is necessary to define key terms. Terrorism is notoriously difficult to define, and definitions are largely disagreed upon throughout industry, academia, and government. We will use data from the Global Terrorism Database (GTD) as a basis for our analysis, and therefore will utilize the definitions of terrorism and terrorist attacks provided in the datasets codebook:

A terrorist attack is the threatened or actual use of illegal force and violence by a non-state actor to gain political, economic, religious, or social goals through fear coercion, or intimidation.

(Source: GTD Codebook)

It is worth analyzing this definition to gain a better understanding of what is and what is not terrorism for the purposes of this project. The codebook indicates that terrorist attacks must be intentional, that does not mean that the attack is carried out exactly as intended, but rather that there is an intended target, a method by which to inflict harm, and perhaps evidence of planning. Additionally, there a terrorist attack must include violence, or immediate threat of violence. This includes violence against both people and property. Violence in the codebook is to mean intention to cause injury and/or irrevocable destruction/kinectic damage. It is worth noting that the perpetrators must be sub-national actors. The database does not include acts of state terrorism--including persons who are employed by the state and/or are acting on behalf of a state or nation. This criteria does not exclude state-sponsored attacks, but rather only attacks perpetrated by state actors.

As with the definition above, two of the following three criteria must be met for inclusion in the dataset: 1. The act must be aimed at attaining a political, economic, religious, or social goal. In terms of economic goals, the exclusive pursuit of profit does not satisfy this criterion. It must involve the pursuit of more profound, systemic economic change. 2. There must be evidence of an intention to coerce, intimidate, or convey some other message to a larger audience (or audiences) than the immediate victims. It is the act taken as a totality that is considered, irrespective if every individual involved in carrying out the act was aware of this intention. As long as any of the planners or decision-makers behind the attack intended to coerce, intimidate or publicize, the intentionality criterion is met. 3. The action must be outside the context of legitimate warfare activities. That is, the act must be outside the parameters permitted by international humanitarian law.

For additional explanation of these criteria, as well as examples, please see the GTD Codebook.

Data Collection

Thankfully, the most tedious part of the data science pipeline has been done for us. The researchers over at the National Consortium for the Study of Terorrism and Responses to Terrorism (START) has amalgamated data for global terrorism incidents from 1970-2019 in their Global Terrorism Database. The database, informed by open-source media articles, contains more than 100 structured variables characterize each attack’s location, tactics and weapons, targets, perpetrators, casualties and consequences, and general information such as definitional criteria and links between coordinated attacks. Unstructured variables include summary descriptions of the attacks and more detailed information on the weapons used, and specific motives of the attackers. The GTD is accessible for individuals and organizations from their website: https://start.umd.edu/gtd/.

While the methodology for collecting data has evolved since the inception of the database in 2006, it is worth mentioning the hybrid workflow the GTD employs to collect, process, and publish data today. The process starts with a pool of more than two million open-source media reports published each day. The GTD team combines automated and human workflows, leveraging the strengths and mitigating the limitations of each, to produce rich and reliable data. On the automated side, GTD researchers leverage boolean filters of articles, natural language processing (NLP), deduplication of articles, location identification, clustering of similar articles, and machine learning (ML) models to identify relevancy of articles. After the automated process has gathered, filtered and labeled articles, a team of analysts triage the articles to assess source validity, apply inclusion criteria, and create narratives of single incidents from multiple sources. The incidents are then coded by smaller teams, with specific domain expertise.

Data Processing

After creating an indiviual-use account for the GTD, we download the dataset and import as a dataframe using pandas.

Perhaps the first thing to note is that this dataframe quite large to be manipulating in a Jupyter Notebok. It contains over 200,000 incidents and 135 columns, and takes up about 100MB of memory. A dataset of this size may not be considered "big data", but it warrants careful consideration of how we analyze the data to avoid long wait times and computational inefficiency. First we are going to "clean" the data, by taking a subset of the columns that we will be using for our analysis, then our dataframe will be easier to iterate over and operate on and easier to read. Let's start our analysis with some simple plots. First let's look at the number of terrorist incidents and casualties over time

The new dataframe is about one-sixth the size of the original, now 15.2MB. The original dataframe contained columns that described sources, validity, detailed text descriptions, and more. With the exception of two unstructured text columns, we have only kept structured data describing the incidents. We breakdown columns and what they are recording in the table below:

Column Name Variable Name Data Type Description
iyear Year interval This field contains the year in which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the year when the incident was initiated.
imonth Month categorical This field contains the number of the month in which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the month when the incident was initiated.
iday Day interval This field contains the numeric day of the month on which the incident occurred. In the case of incident(s) occurring over an extended period, the field will record the day when the incident was initiated.
country
country_txt
Country categorical This field identifies the country or location where the incident occurred. Separatist regions, such as Kashmir, Chechnya, South Ossetia, Transnistria, or Republic of Cabinda, are coded as part of the “home” country.
region
region_txt
Region categorical This field identifies the region in which the incident occurred. The regions are divided into the 13 categories, and dependent on the country coded for the case: North America, Central America & Caribbean, South America, East Asia, Southeast Asia, South Asia, Central Asia, Western Europe, Eastern Europe, Middle East & North Africa, Sub-Saharan Africa, amd Australasia & Oceania.
provstate Province/State text This variable records the name (at the time of event) of the 1st order subnational administrative region in which the event occurs.
city City text This field contains the name of the city, village, or town in which the incident occurred. If the city, village, or town for an incident is unknown, then this field contains the smallest administrative area below provstate which can be found for the incident (e.g., district).
attacktype1
attacktype1_txt
Attack Type categorical This field captures the general method of attack and often reflects the broad class of tactics used. It consists of nine categories, which are listed here: Assassination, Hijacking, Kidnapping, Barricade Incident, Bombing/Explosion, Armed Assault, Unarmed Assault, Facility/Infrastructure Attack, Unknown.
targtype1
targtype1_txt
Target Type categorical The target/victim type field captures the general type of target/victim. When a victim is attacked specifically because of his or her relationship to a particular person, such as a prominent figure, the target type reflects that motive. For example, if a family member of a government official is attacked because of his or her relationship to that individual, the type of target is “government.” This variable consists of 22 categories that can be found in the GTD codebook.
gname Perpetrator Group Name unstructured text This field contains the name of the group that carried out the attack. In order to ensure consistency in the usage of group names for the database, the GTD database uses a standardized list of group names that have been established by project staff to serve as a reference for all subsequent entries. In the event that the name of a formal perpetrator group or organization is not reported in source materials, this field may contain relevant information about the generic identity of the perpetrator(s) (e.g., “Protestant Extremists”). Note that these categories do not represent discrete entities. They are not exhaustive or mutually exclusive (e.g., “student radicals” and “left-wing militants” may describe the same people). They also do not characterize the behavior of an entire population or ideological movement. For many attacks, generic identifiers are the only information available about the perpetrators. Because of this they are included in the database to provide context; however, analysis of generic identifiers should be interpreted with caution.
city City text This field contains the name of the city, village, or town in which the incident occurred. If the city, village, or town for an incident is unknown, then this field contains the smallest administrative area below provstate which can be found for the incident (e.g., district).
motive Motive unstructured text When reports explicitly mention a specific motive for the attack, this motive is recorded in the “Motive” field. This field may also include general information about the political, social, or economic climate at the time of the attack if considered relevant to the motivation underlying the incident. Note: This field is presently only systematically available with incidents occurring after 1997.
weaptype1
weaptype1_txt
Weapon Type categorical This field records the general type of weapon used in the incident. It consists of the following categories: Biological, Chemical, Radiological, Nuclear, Firearms, Explosives, Fake Weapons, Incendiary, Melee, Vehicle, Sabotage Equipment, Other, and Unknown.
nkill Total Number of Fatalities ratio This field stores the number of total confirmed fatalities for the incident. The number includes all victims and attackers who died as a direct result of the incident. Where there is evidence of fatalities, but a figure is not reported or it is too vague to be of use, such as “many” or “some,” this field remains blank.
nwound Total Number of Injured ratio This field records the number of confirmed non-fatal injuries to both perpetrators and victims. It follows the conventions of the “Total Number of Fatalities” field described above.

With the exception of gname and motivation, all the variables we have included in the data frame are structured and well-defined. For additional information on each of the variables and examples of how they are coded, see the GTD codebook. Now that we have our dataframe, we can proceed to some exploratory data analysis.

Exploratory Data Analysis and Visualization

Let's begin by examining the correlation between the variables in our dataset using a heatmap. Note that we need to unpivot our categorical variables before attempting to identify correlation between variables in our dataset, otherwise the corr method will assume that the categorical numerical variables are actually interval variables.

The plot above is a correlation matrix that shows the correlation coefficient between each variable in the dataset. Unfortunately it does not seem as though there are many variables that are highly correlated in our dataset. While our visualization is fun, it is clunky and hard to interpret. Let's examine further by finding all of the pairs of variables who correlation coefficient is greater than 0.3 or less than -0.3. It is generally accepted, although arbitrary, that 0.3 represents a weak correlation between variables, between 0.3 and 0.7 implies moderate correlation and 0.7 or greater implies strong correlation between variables.

We have 17 pairs of variables that are at least weakly correlated. Let's see if any of the relationships are not easily explainable or artificial. The first threes pairs show a negative correlation between year and three region in the data sets. This likely points to a decrease in terrorism in Central America & Caribbean, South America, and Western Europe from 1970 through 2019. The fourth pair shows correlation between casualties and injuries, which is intuitive; the more people who are injured in an attack, the more likely there are to be casualties and vice-versa. Pairs six through eight indicate that there during incidents classifies as armed attacks there are typically firearms used rather than other types of weapons, this is obvious. Similarly the correlations in pairs 10 and 11, shows that during attack types where there is a bombing or explosive, there is a very strong correlation to the weapon type being a bomb; once again, this is obvious. Pair 12 shows that there is a strong correlation between incidents classified as attacks on infrastructure and the use of incindiery weapons. This likely just means that the number of arson cases in the GTD is far greater than any attacks on people using incindiery weapons. Pair 13 indicates that there is a weak correlation between incidents classified as kidnappings and incidents where the weapon type was unknown. Perplexingly, there is an interesting correlation between incidents classified assaults and the use of chemical weapons. This perhaps has to do with how incidents in the GTD are coded, but warrant further investigation. Pair 15 shows that there is a strong correlation between incidents where the weapon was unknown and incidents where the attack classification was unknown. This is likely a reflection of gaps in open-source data. Finally pairs 5, 16 and 17 are artificial correlations, since they were derived from the same variable and are therefore meaningless. Now that we have examined our correlated pairs for, let's investigate some of the less obvious correlations. Namely, the correlation between year and region; the correlation between incidiery weapons and attacks on infrastructure; and the correlation between unarmed assaults and the use of chemical weapons.

Let's begin by plotting some general information about terrorist attacks over time, such as the number of attacks per year and the number of casualties per year.

Generally, we see trends of increasing and decreasing attacks and casualties over time, with a notable spike followed by a rapid decline in 2014. The spike can attributed the number of attacks per year and the number of casualties per year. The spike coincides with the formation of the Islamic State of Iraq and Syria (ISIS) in 2013, and its declaration of a caliphate in 2014.

Let's recreate the plot above, for each of the regions in the dataset:

Let's continue, by looking at a violin plot of attacks per year by region.

Some commentary here

Analysis, Hypothesis Testing, and Machine Learning

Insight and Policy Decision